refactor(rust): Use ObjectStore instead of AsyncRead in parquet get metadata #15069

mickvangelderen · 2024-03-14T15:05:52Z

For the record, I do like the idea of the AsyncRead and AsyncSeek and AsyncWrite abstraction. We just want to eliminate the possibility of the implementation causing network connection resets. Maybe I have to split the object store deduplication from this PR.

@ritchie46 is there any way we can determine whether this solves #14384?

metadata

tustvold · 2024-03-14T21:26:06Z

crates/polars-io/src/parquet/async_impl.rs

+}
+
+/// Asynchronously reads the files' metadata
+pub async fn fetch_metadata(


FWIW https://docs.rs/parquet/latest/parquet/arrow/async_reader/struct.MetadataLoader.html provides an implementation of this which also handles prefetch and loading page indices

Thanks for the tip @tustvold. @mickvangelderen can you take a look at that?

From what I can see, that implementation is similar. It allows passing a footer size hint to hopefully save one round trip and has tests that also check the number of fetches.

@ritchie46 what do we want to do with that?

@ritchie46 what do we want to do with that?

In what sense? The footer size or the tests? I would favor us the be conservatively here. Rather a larger request but have one, then needing two calls. But that's my gut feeling.

For that to make sense we need to have a good idea about how large the footer metadata is in the common case. We are making a trade-off between latency and throughput. If we request a large amount of data so that the footer metadata is likely contained within it, we are possibly waiting to download data that we are going to throw away again (the part that we are downloading but do not use). If we request too few bytes we are affected by both throughput and latency. If we request exactly enough bytes we are affected by latency but not the throughput for bytes that we throw away.

What is most efficient depends on the latency, throughput and footer byte size.

Having to read files backwards is not great. It would be nice if file formats optimize for reading rather than writing. If we take one step back, the real solution is to use a file format that can be read efficiently over a network connection.

Anyway, I am happy to reimplement the optimistic fetch. What should the pre-fetch size be? Should it be configurable (env or parameter)?

tustvold · 2024-03-14T21:31:13Z

For the record, I do like the idea of the AsyncRead and AsyncSeek and AsyncWrite abstraction.

FYI there's work in flight to remove the final use of these traits in ObjectStore, I'd be interested in any thoughts on this - apache/arrow-rs#5500

ritchie46 · 2024-03-15T07:47:10Z

For the record, I do like the idea of the AsyncRead and AsyncSeek and AsyncWrite abstraction. We just want to eliminate the possibility of the implementation causing network connection resets. Maybe I have to split the object store deduplication from this PR.

I don't. It is really not clear how many calls that will lead to. As it would work for any generic async function.

ritchie46 · 2024-03-15T08:46:28Z

@ion-elgreco had a problem, but they are hard to reproduce locally. If we have this PR ready we could ask him to take it for a spin.

mickvangelderen · 2024-03-15T09:26:13Z

FYI there's work in flight to remove the final use of these traits in ObjectStore, I'd be interested in any thoughts on this - apache/arrow-rs#5500

I guess AsyncRead + AsyncSeek would not allow you to spawn two connections and read from different parts of the same file concurrently if you wanted to. So maybe those traits don't give the control that we need indeed.

Seems like a nice PR, improving features while removing more code than you're adding is often a good sign.

ion-elgreco · 2024-03-15T10:16:38Z

@ritchie46 let me know when I can try it out :)

ritchie46

Looks great Mick!

ritchie46 · 2024-03-15T11:08:01Z

crates/polars-io/src/parquet/async_impl.rs

+}
+
+/// Asynchronously reads the files' metadata
+pub async fn fetch_metadata(


@ritchie46 what do we want to do with that?

In what sense? The footer size or the tests? I would favor us the be conservatively here. Rather a larger request but have one, then needing two calls. But that's my gut feeling.

refactor(rust): Use ObjectStore instead of AsyncRead in parquet get

a2278ff

metadata

mickvangelderen requested review from ritchie46, stinodego, orlp and c-peters as code owners March 14, 2024 15:05

github-actions bot added internal An internal refactor or improvement rust Related to Rust Polars labels Mar 14, 2024

mickvangelderen force-pushed the deduplicate-polars-object-store branch from 7f2815e to e123b3b Compare March 14, 2024 15:20

Delete CloudReader

66fe8d7

mickvangelderen force-pushed the deduplicate-polars-object-store branch from e123b3b to 66fe8d7 Compare March 14, 2024 16:10

Make things compile for different combinations of feature flags

860ec33

mickvangelderen force-pushed the deduplicate-polars-object-store branch from baa9d56 to 860ec33 Compare March 14, 2024 16:38

tustvold reviewed Mar 14, 2024

View reviewed changes

Increase thrift max_size value

1c0a138

ritchie46 approved these changes Mar 15, 2024

View reviewed changes

ritchie46 merged commit 4933040 into pola-rs:main Mar 15, 2024
18 checks passed

mickvangelderen deleted the deduplicate-polars-object-store branch March 15, 2024 13:17

mrdaulet mentioned this pull request Aug 18, 2024

pl.read_parquet cannot read signed GCS signed url on 0.20, but can on <0.19 #14908

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(rust): Use ObjectStore instead of AsyncRead in parquet get metadata #15069

refactor(rust): Use ObjectStore instead of AsyncRead in parquet get metadata #15069

mickvangelderen commented Mar 14, 2024 •

edited

Loading

tustvold Mar 14, 2024

ritchie46 Mar 15, 2024

mickvangelderen Mar 15, 2024

ritchie46 Mar 15, 2024

mickvangelderen Mar 15, 2024 •

edited

Loading

tustvold commented Mar 14, 2024

ritchie46 commented Mar 15, 2024

ritchie46 commented Mar 15, 2024

mickvangelderen commented Mar 15, 2024 •

edited

Loading

ion-elgreco commented Mar 15, 2024

ritchie46 left a comment

ritchie46 Mar 15, 2024

refactor(rust): Use ObjectStore instead of AsyncRead in parquet get metadata #15069

refactor(rust): Use ObjectStore instead of AsyncRead in parquet get metadata #15069

Conversation

mickvangelderen commented Mar 14, 2024 • edited Loading

tustvold Mar 14, 2024

Choose a reason for hiding this comment

ritchie46 Mar 15, 2024

Choose a reason for hiding this comment

mickvangelderen Mar 15, 2024

Choose a reason for hiding this comment

ritchie46 Mar 15, 2024

Choose a reason for hiding this comment

mickvangelderen Mar 15, 2024 • edited Loading

Choose a reason for hiding this comment

tustvold commented Mar 14, 2024

ritchie46 commented Mar 15, 2024

ritchie46 commented Mar 15, 2024

mickvangelderen commented Mar 15, 2024 • edited Loading

ion-elgreco commented Mar 15, 2024

ritchie46 left a comment

Choose a reason for hiding this comment

ritchie46 Mar 15, 2024

Choose a reason for hiding this comment

mickvangelderen commented Mar 14, 2024 •

edited

Loading

mickvangelderen Mar 15, 2024 •

edited

Loading

mickvangelderen commented Mar 15, 2024 •

edited

Loading